I. Introduction

Gross domestic product (GDP) is one of the most common indicators used to track the health of a nation’s economy, and GDP per capita is calculated by taking the GDP of a country divided by its total population and is generally accepted as a measure of the standard of living. This study aims to examine the relationship between an average person’s living standards in a country and some other social indicators in order to build a model with reliable predictability.

The data used in this study is obtained from the world bank website that includes the GDP per capita, the unemployment rates, the urban population percentage, and the high-tech export of the countries in 2019, and the middle school enrollment rates in 2000. Data for the fiscal year of 2020 were available; however, the author chose to not based the study on that for the reason that the year has been an outlying occasion economically and socially for the world, and thus should not be considered for a study that aims to understand such relationship under the norm.

Four predictor variables were chosen to be used in this study. The yearly unemployment rates of a country were considered since a country is likely to be more successful fiscally with a larger working force. The urban population percentage was also considered as economic growth usually comes with urbanization, and they are positively correlated. The high-tech exports was chosen because it indicates if a country is industrialized and thus experience tremendous growth economically. Data for these indicators will be of the same year as the variable of interest, GDP per capita. However, only data for the predictor variable middle school enrollment rates is taken from the year 2000, because a time lag of 19 years indicates that the people who were going to middle school back then are now a major part of the workforce - the very class that generates most of the nation’s wealth.

After organizing the data and removing countries with missing values, and non-country/territory observations, 71 countries remain represented in the dataset, out of the 195 countries in the world (approximately 36.4%). We recognize that this is a considerable amount of data loss and could introduce potential biases and reduce the generalizability of our findings.

Table 1.Sample for 5 randomly chosen countries of the data set used in this study
Country GDP2019 MidSchool2000 Unemployment2019 HighTech2019 UrbanPopPercentage2019
Korea, Rep.  31846.218 92.63574 4.148000 153561173548 81.43000
Ecuador 6183.824 47.76897 3.968000 68045527 63.98600
United Kingdom 42330.118 94.68536 3.851000 78176113113 83.65200
North America 63344.078 86.91654 3.889744 188786437936 82.36169
Zimbabwe 1463.986 42.45882 4.954000 27810712 32.21000

II. Exploratory data analysis


Table 2: Summary for the GDP per capita
n min median mean max sd
71 411.5523 11611.42 22962.63 114704.6 23786.58

Our total sample size is 71 (Table 2). The mean GDP per capita is about 22,962.63, far greater than our median 11,611.42, indicating that our GDP per capita distribution is heavily right-skewed and might be affected by outlying observation, which can easily be observed in Figure 1. This is understandable because the global wealth is not distributed evenly: some countries own significantly more wealth than others.

Figure 1. Distribution of the GDP per capita for individual countries in 2019

Figure 1. Distribution of the GDP per capita for individual countries in 2019

Figure 2 shows the distribution of unemployment rates in 2019, which is also right-skewed and have some extreme outliers lying around 20-30%. We can observe that middle school enrollment rate of countries in 2000 has a left-skewed distribution, however, the tail is heavy so those cannot be considered outliers in figure 3.

Figure 2. Distribution of the unemployment rate for individual countries in 2019

Figure 2. Distribution of the unemployment rate for individual countries in 2019

Figure 3. Distribution of the middle school enrollment rate for individual countries in 2000

Figure 3. Distribution of the middle school enrollment rate for individual countries in 2000

Figure 4. Distribution of the high-tech exports for individual countries in 2019

Figure 4. Distribution of the high-tech exports for individual countries in 2019

In figure 4 and 5, while the distribution of high-tech export is extremely right-skewed with many outlying observations, the urban population percentage is only slightly left-skewed with no obvious outliers.

Figure 5. Distribution of the urban population percentage for individual countries in 2019

Figure 5. Distribution of the urban population percentage for individual countries in 2019

Figure 6. Interactive Scatterplot for the GDP per capita in 2019 for individual countries against their urban population percentage in the same year. The red line is the best fit line. The blue curve is the Loess curve.

In figure 6, the scatterplot shows that there seems to be some correlation between the GDP per capita and the Urban Population Percentage, which suggests that, without implying any causal effect, countries with a higher average standard of living for their people tend to have a higher proportion of its people living in urban areas.

The scatter plot in Figure 7 suggests that the unemployment rate and GDP per capita are negatively correlated. More notably, we notice that purple points cluster at the top whereas yellow points are more at the bottom. This implies that countries that had high middle-school enrollment rates in 2000 also have a higher standard of living 19 years later. This is better illustrated in Figure 8, we also notice that an upward curvature would better fit this relationship than a straight line.

Figure 7. Interactive Scatterplot for the GDP per capita in 2019 for individual countries against their unemployment rates of the same year. The red line is the best fit line. The blue curve is the Loess curve.

Figure 8. Interactive Scatterplot for the GDP per capita for individual countries against their middle school enrollment rates in the year 2000. The red line is the best fit line. The blue curve is the Loess curve.

Figure 9. Interactive Scatterplot for the GDP per capita 2019 for individual countries against their High-tech Exports. The red line is the best fit line. The blue curve is the Loess curve.


III. Multiple linear regression

i. Methods


Since the exploratory part shows that the distribution of our GDP per capita is right-skewed and has some outliers, we have decided that it is in our best interest to transform the data to tackle this problem. We also recognize the danger of overfitting, so we will not be using box-cox to optimize the transformation (for this set of data), but rather go with a more “natural” type of transformation: taking the square root.

Figure 10. Distribution of GDP per capita in 2019 raised to 0.5, for individual countries

Figure 10. Distribution of GDP per capita in 2019 raised to 0.5, for individual countries

Using the following model:

## lm(formula = GDP2019_transf ~ HighTech2019 + ns(UrbanPopPercentage2019, 
##     df = 3) + ns(MidSchool2000, df = 3) + ns(Unemployment2019, 
##     df = 3), data = tidy_joined_dataset)

We have decided to keep the high-tech exports variable linear due to the fact that this chosen variable is extremely right skewed and does not have the spread needed for flexible alternatives (such as natural splines or polynomials). Except for that, we used natural splines for every other variables, which are unemployment rates, middle-school enrollment rates and urban population percentage. The number of knots used is 4, according to the sample size (<100).

After the square root transformation, we observe that, though not perfect, the plots have shown more promising results: In figures 11, 12, and 13, the normal Q-Q plot shows an almost straight line, the distribution of error terms is more symmetric, however, the residual scatter plot does seem to be violating the homoscedasticity assumption.

Figure 11. Normal Q-Qplot for the square root of GDP per capita in 2019

Figure 11. Normal Q-Qplot for the square root of GDP per capita in 2019

In table 3, we see that the GVIF value for the variables with 1 degree of freedom each, and the GVIF^(1/(2*Df)) value for the variables with more than 1 degree of freedom each are all between 1 and 5. This indicates that there is moderate correlation between the predictor variables. Since there is not a lot of multicollinearity between the predictor variables, the statistical power of the model is not greatly reduced, and we can perform the desired analysis.

Table 3: VIF table
GVIF Df GVIF^(1/(2*Df))
HighTech2019 1.181734 1 1.087076
ns(UrbanPopPercentage2019, df = 3) 2.175815 3 1.138336
ns(MidSchool2000, df = 3) 4.056648 3 1.262878
ns(Unemployment2019, df = 3) 2.894752 3 1.193810

ii. Model Results and Interpretation

Our model is the following:

## lm(formula = GDP2019_transf ~ HighTech2019 + ns(UrbanPopPercentage2019, 
##     df = 3) + ns(MidSchool2000, df = 3) + ns(Unemployment2019, 
##     df = 3), data = tidy_joined_dataset)

Given the nature of splines, interpretation of the model coefficients is deemed pointless as all else unchanged is not a possibility to predict the average square root of GDP per capita. Alternatively, we focus on examining the coefficients and their relative significance in the ANOVA table analysis section.

We notice that the coefficient p-values in table 4 tell us is that the urban population percentage and middle-school enrollment rates with their 1 and 3 levels, unemployment 1 level share the trait of their levels having a p-value < 0.05, whereas high-tech exports was found to be insignificant with p-values > 0.05.


Table 4. Model Summary Table
Estimate Std. Error t value Pr(>|t|)
(Intercept) 53.4012 28.5225 1.8722 0.0660
HighTech2019 0.0000 0.0000 0.9017 0.3708
ns(UrbanPopPercentage2019, df = 3)1 43.7492 25.0161 1.7488 0.0854
ns(UrbanPopPercentage2019, df = 3)2 11.6767 51.7280 0.2257 0.8222
ns(UrbanPopPercentage2019, df = 3)3 77.3089 25.0750 3.0831 0.0031
ns(MidSchool2000, df = 3)1 86.5470 26.6525 3.2472 0.0019
ns(MidSchool2000, df = 3)2 75.1719 78.5516 0.9570 0.3424
ns(MidSchool2000, df = 3)3 148.4772 25.9431 5.7232 0.0000
ns(Unemployment2019, df = 3)1 -54.5920 32.2892 -1.6907 0.0961
ns(Unemployment2019, df = 3)2 47.3276 68.1545 0.6944 0.4901
ns(Unemployment2019, df = 3)3 46.1235 39.6449 1.1634 0.2493
Value df
Residual Standard Error 43.042 60
Multiple R-squared 0.714
Adjusted R-squared 0.666
Value Numerator df Denominator df
Model F-statistic 14.97 10 60
P-value 6.338e-13

However, what important is the model as a whole is useful.Seeing the adjusted R-squared of 0.666 using our model, we found that it explains a lot of variability of the average GDP per capita transformed to the power of 0.5 which, coupled with the significance of the predictors and the low p-value of 6.338e-13 for our model, lead us to believe it is helpful in its explanatory ability.


iii. Inference for multiple regression

From the ANOVA table in table 6, the High-tech Exports with 1 degree of freedom add 10866.301 sum of squares. With an F value =5.8654 and p-value equals 0.0185, we can conclude that the High-tech Exports alone in the model explains a significant amount of variability.

The Urban population Percentage variable with 4 knots and 3 degrees of freedom keeps adding 163412.005 sum of squares. With an F value =29.4020 and p-value equals 0.0000, we can conclude that the Urban population Percentage variable, given that the High-tech Exports in the model, is statistically significant.

The Middle-school enrollment rates variable with 4 knots and 3 degrees of freedom keeps adding 94844.386 sum of squares. With an F value =17.0649 and p-value equals 0.0000, we can conclude that the Middle-school enrollment rates variable, given that the High-tech Exports and Urban population Percentage with 4 knots in the model, is statistically significant.

The Unemployment Rates variable with 4 knots and 3 degrees of freedom keeps adding 8154.963 sum of squares. With an F value =1.4673 and p-value equals 0.2325, we can conclude that the Unemployment Rates variable, given that the High-tech Exports, Urban population Percentage with 4 knots, and Middle-school enrollment rates variable, also with 4 knots, in the model, is statistically insignificant.

Table 6. ANOVA Table
Df Sum Sq Mean Sq F value Pr(>F)
HighTech2019 1 10866.301 10866.301 5.8654 0.0185
ns(UrbanPopPercentage2019, df = 3) 3 163412.005 54470.668 29.4020 0.0000
ns(MidSchool2000, df = 3) 3 94844.386 31614.795 17.0649 0.0000
ns(Unemployment2019, df = 3) 3 8154.963 2718.321 1.4673 0.2325
Residuals 60 111156.967 1852.616 NA NA

The 95% Prediction Intervals: For the 95% Prediction Interval, any country with 0% of urban population, unemployment rates = 5%, middle-school enrollment rates in 2000 is 60% and they export $10,000,000 worth of high-tech products, their square root of GDP per capita can be predicted at 98.09265 with the lower limit is -18.23444 and upper limit is 214.4197

With those countries holding the same value with unemployment rates = 5%, middle-school enrollment rates in 2000 is 60% and they export $10,000,000 worth of high-tech products. The Prediction Interval table below shows the predicted square root of GDP per capita for urban population percentage equals 40, 50, 60, and 70%.

Table 7. The 95% Prediction intervals for the square root of GDP per capita, where Urban population percentage = 0, 40, 50, 60, 70, respectively, for unemployment rate = 5, middle school enrollment rate = 60, high-tech exports = 10,000,000 USD
UrbanPopPercentage2019 Point Estimate Lower Limit Upper Limit
0 98.09265 -18.23444 214.4197
40 56.62910 -35.61627 148.8745
50 56.75335 -35.23696 148.7437
60 66.94692 -23.97692 157.8708
70 86.77952 -4.18649 177.7455

IV. Discussion

i. Conclusions

We recognize that interpretability is sometimes to be traded for the sake of a better model. Our analysis shows that the model we proposed seems to be helpful as it explains quite a good amount of variability in GDP per capita in 2019 (66.6%).

ii. Limitations

This project is limited by the data available. The decision to use the combination chosen indicators reduced the usable countries down to only 36.4%, due to excluding countries with a missing value in any of the variables used. Additionally, there were some notable outliers and points with high leverage that could not be removed because they were not mistakes and thus are legit.

The choice to use a non-linear model made the interpretation of the relationship between the variables more complex and less straightforward, which is a trade off that the author is well aware of.

The study didn’t have any test of any kind for over fitting, so we don’t know how this proposed model will perform outside of this given data set.

```

iii. Further questions


V. Citations and References